!pip install pgeocode
Defaulting to user installation because normal site-packages is not writeable Collecting pgeocode Downloading pgeocode-0.4.1-py3-none-any.whl.metadata (8.4 kB) Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from pgeocode) (2.31.0) Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from pgeocode) (1.23.2) Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from pgeocode) (1.5.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas->pgeocode) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas->pgeocode) (2022.7) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.8/dist-packages (from requests->pgeocode) (2.0.12) Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests->pgeocode) (2.8) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/lib/python3/dist-packages (from requests->pgeocode) (1.25.8) Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->pgeocode) (2019.11.28) Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas->pgeocode) (1.14.0) Downloading pgeocode-0.4.1-py3-none-any.whl (9.8 kB) Installing collected packages: pgeocode Successfully installed pgeocode-0.4.1
Importing libraries
- pandas: To read data into dataframes
- pgeocode: Used to get lat and long of the zipcodes
- Matplotlib: For plotting
- Folium: For plotting on a map
- Ticker: to determine to plot how many ticks in the plot
import pandas as pd
import pgeocode
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
import numpy as np
import matplotlib.ticker as ticker
import seaborn as sns
# Reading sheet regency data to a dataframe
regencyData = pd.read_excel('data.xlsx', sheet_name='Regency_Data', dtype={'Zip Code': str})
regencyData.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Business Unit | NaN | 15.0 | 15.0 | 15.0 | 15.0 |
| Division | NaN | Northeast | Northeast | Northeast | Northeast |
| Region | NaN | PA-DE | PA-DE | PA-DE | PA-DE |
| State | NaN | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
| City | NaN | Wayne | Wayne | Wayne | Wayne |
| ... | ... | ... | ... | ... | ... |
| 2023-04-01 00:00:00 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2023-05-01 00:00:00 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2023-06-01 00:00:00 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2023-07-01 00:00:00 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2023-08-01 00:00:00 | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
298 rows × 5 columns
# Replace substring in column names
#test = regencyData.copy()
for column in regencyData.columns:
name = str(column).replace('-01 00:00:00', '')
regencyData.rename(columns={column: "{}".format(name)}, inplace=True)
print(name)
Business Unit Division Region State City Zip Code Gross Leasable Area (SF) Year Constructed Year of last Completed Redev Year Acquired Control Category Expense Description 2000-01 2000-02 2000-03 2000-04 2000-05 2000-06 2000-07 2000-08 2000-09 2000-10 2000-11 2000-12 2001-01 2001-02 2001-03 2001-04 2001-05 2001-06 2001-07 2001-08 2001-09 2001-10 2001-11 2001-12 2002-01 2002-02 2002-03 2002-04 2002-05 2002-06 2002-07 2002-08 2002-09 2002-10 2002-11 2002-12 2003-01 2003-02 2003-03 2003-04 2003-05 2003-06 2003-07 2003-08 2003-09 2003-10 2003-11 2003-12 2004-01 2004-02 2004-03 2004-04 2004-05 2004-06 2004-07 2004-08 2004-09 2004-10 2004-11 2004-12 2005-01 2005-02 2005-03 2005-04 2005-05 2005-06 2005-07 2005-08 2005-09 2005-10 2005-11 2005-12 2006-01 2006-02 2006-03 2006-04 2006-05 2006-06 2006-07 2006-08 2006-09 2006-10 2006-11 2006-12 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08 2007-09 2007-10 2007-11 2007-12 2008-01 2008-02 2008-03 2008-04 2008-05 2008-06 2008-07 2008-08 2008-09 2008-10 2008-11 2008-12 2009-01 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10 2009-11 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10 2011-11 2011-12 2012-01 2012-02 2012-03 2012-04 2012-05 2012-06 2012-07 2012-08 2012-09 2012-10 2012-11 2012-12 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 2021-09 2021-10 2021-11 2021-12 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06 2022-07 2022-08 2022-09 2022-10 2022-11 2022-12 2023-01 2023-02 2023-03 2023-04 2023-05 2023-06 2023-07 2023-08
Used describe in order to get statistical analysis on numerical and categorical variables:
- Gross Leasable Area (SF) is missing a few attributes so I will remove them from the data
- Oldest building is from 1906 and newest is in 2018. Will explore how much that affectes the expenses.
- Company has most of items work in Florida. However, they are most concentrated in the city of Atlanta
regencyData.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Business Unit | 29488.0 | 258452.237317 | 1.165765e+06 | 15.00 | 629.0000 | 60706.0 | 80101.0 | 8010204.00 |
| Gross Leasable Area (SF) | 29386.0 | 128452.048288 | 1.011916e+05 | 0.00 | 78820.0000 | 106482.0 | 152294.0 | 1072259.00 |
| Year Constructed | 29488.0 | 1988.729110 | 1.719589e+01 | 1906.00 | 1979.0000 | 1990.0 | 2000.0 | 2018.00 |
| Year Acquired | 29488.0 | 2007.181158 | 7.701766e+00 | 1993.00 | 1999.0000 | 2005.0 | 2016.0 | 2021.00 |
| 2000-01 | 29488.0 | -79.443192 | 1.022161e+03 | -51688.00 | 0.0000 | 0.0 | 0.0 | 11523.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2023-04 | 29488.0 | -1157.239697 | 7.341190e+03 | -486572.78 | -297.9375 | 0.0 | 0.0 | 32292.47 |
| 2023-05 | 29488.0 | -1165.502560 | 7.719511e+03 | -540258.47 | -264.1175 | 0.0 | 0.0 | 66000.00 |
| 2023-06 | 29488.0 | -1185.444853 | 7.767677e+03 | -520149.82 | -208.6850 | 0.0 | 0.0 | 300000.00 |
| 2023-07 | 29488.0 | -1177.278201 | 7.493314e+03 | -526337.03 | -240.3650 | 0.0 | 0.0 | 31393.03 |
| 2023-08 | 29488.0 | -1146.933299 | 7.525011e+03 | -526337.03 | -227.7650 | 0.0 | 0.0 | 96064.35 |
288 rows × 8 columns
regencyData.describe(include=['O']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Division | 29488 | 4 | Southeast | 9737 |
| Region | 29488 | 20 | VA-MD-DC | 2786 |
| State | 29488 | 25 | Florida | 6235 |
| City | 29488 | 248 | Atlanta | 911 |
| Zip Code | 29488 | 340 | 6824 | 397 |
| Year of last Completed Redev | 14494 | 35 | 0 | 3633 |
| Control | 29330 | 3 | Controllable | 15957 |
| Category | 29488 | 4 | Operating Expenses | 21753 |
| Expense | 29488 | 51 | Life Safety | 1899 |
| Description | 29488 | 375 | INSURANCE PROPERTY | 423 |
# Seeing how many unity they have by checking the number of unique values in UnitID
regencyData['Business Unit'].nunique()
435
- There a lot of missing values for Year of last Completed Redev. However, that might indicate that it didn't need redevelopment. will replace missing values with 'Never'.
- Control is missing so will remove missing rows
- Gross Leasable Area (SF) is missing some values so will remove them.
- All attribus are missing one value, most likely due to the filtering row in the excel file.
#Getting Nan values for every column
nan_count = regencyData.isna().sum().sort_values(ascending=False)
for index, row in nan_count.items():
print(f'{index} {row}')
Year of last Completed Redev 14995 Control 159 Gross Leasable Area (SF) 103 2015-04 1 2015-11 1 2015-10 1 2015-09 1 2015-08 1 2015-07 1 2015-06 1 2015-05 1 Business Unit 1 2015-03 1 2016-01 1 2015-02 1 2015-01 1 2014-12 1 2014-11 1 2014-10 1 2014-09 1 2015-12 1 2016-03 1 2016-02 1 2016-11 1 2017-05 1 2017-04 1 2017-03 1 2017-02 1 2017-01 1 2016-12 1 2016-10 1 2014-07 1 2016-09 1 2016-08 1 2016-07 1 2016-06 1 2016-05 1 2016-04 1 2014-08 1 2014-05 1 2014-06 1 2017-07 1 2012-10 1 2012-09 1 2012-08 1 2012-07 1 2012-06 1 2012-05 1 2012-04 1 2012-03 1 2012-02 1 2012-01 1 2011-12 1 2011-11 1 2011-10 1 2011-09 1 2011-08 1 2012-11 1 2012-12 1 2013-01 1 2013-10 1 2014-04 1 2014-03 1 2014-02 1 2014-01 1 2013-12 1 2013-11 1 2013-09 1 2013-02 1 2013-08 1 2013-07 1 2013-06 1 2013-05 1 2013-04 1 2013-03 1 2017-06 1 2017-09 1 2017-08 1 2020-09 1 2022-01 1 2021-12 1 2021-11 1 2021-10 1 2021-09 1 2021-08 1 2021-07 1 2021-06 1 2021-05 1 2021-04 1 2021-03 1 2021-02 1 2021-01 1 2020-12 1 2020-11 1 2022-02 1 2022-03 1 2022-04 1 2023-01 1 2023-07 1 2023-06 1 2023-05 1 2023-04 1 2023-03 1 2023-02 1 2022-12 1 2022-05 1 2022-11 1 2022-10 1 2022-09 1 2022-08 1 2022-07 1 2022-06 1 2020-10 1 2020-08 1 2011-06 1 2020-07 1 2018-12 1 2018-11 1 2018-10 1 2018-09 1 2018-08 1 2018-07 1 2018-06 1 2018-05 1 2018-04 1 2018-03 1 2018-02 1 2018-01 1 2017-12 1 2017-11 1 2017-10 1 2019-01 1 2019-02 1 2019-03 1 2019-12 1 2020-06 1 2020-05 1 2020-04 1 2020-03 1 2020-02 1 2020-01 1 2019-11 1 2019-04 1 2019-10 1 2019-09 1 2019-08 1 2019-07 1 2019-06 1 2019-05 1 2011-07 1 2011-04 1 2011-05 1 2002-03 1 2003-07 1 2003-06 1 2003-05 1 2003-04 1 2003-03 1 2003-02 1 2003-01 1 2002-12 1 2002-11 1 2002-10 1 2002-09 1 2002-08 1 2002-07 1 2002-06 1 2002-05 1 2003-08 1 2003-09 1 2003-10 1 2004-07 1 2005-01 1 2004-12 1 2004-11 1 2004-10 1 2004-09 1 2004-08 1 2004-06 1 2003-11 1 2004-05 1 2004-04 1 2004-03 1 2004-02 1 2004-01 1 2003-12 1 2002-04 1 2002-02 1 2005-03 1 2002-01 1 2000-06 1 2000-05 1 2000-04 1 2000-03 1 2000-02 1 2000-01 1 Description 1 Expense 1 Category 1 Year Acquired 1 Year Constructed 1 Zip Code 1 City 1 State 1 Region 1 2000-07 1 2000-08 1 2000-09 1 2001-06 1 2001-12 1 2001-11 1 2001-10 1 2001-09 1 2001-08 1 2001-07 1 2001-05 1 2000-10 1 2001-04 1 2001-03 1 2001-02 1 2001-01 1 2000-12 1 2000-11 1 2005-02 1 2005-04 1 Division 1 2008-05 1 2009-09 1 2009-08 1 2009-07 1 2009-06 1 2009-05 1 2009-04 1 2009-03 1 2009-02 1 2009-01 1 2008-12 1 2008-11 1 2008-10 1 2008-09 1 2008-08 1 2008-07 1 2009-10 1 2009-11 1 2009-12 1 2010-09 1 2011-03 1 2011-02 1 2011-01 1 2010-12 1 2010-11 1 2010-10 1 2010-08 1 2010-01 1 2010-07 1 2010-06 1 2010-05 1 2010-04 1 2010-03 1 2010-02 1 2008-06 1 2008-04 1 2005-05 1 2008-03 1 2006-08 1 2006-07 1 2006-06 1 2006-05 1 2006-04 1 2006-03 1 2006-02 1 2006-01 1 2005-12 1 2005-11 1 2005-10 1 2005-09 1 2005-08 1 2005-07 1 2005-06 1 2006-09 1 2006-10 1 2006-11 1 2007-08 1 2008-02 1 2008-01 1 2007-12 1 2007-11 1 2007-10 1 2007-09 1 2007-07 1 2006-12 1 2007-06 1 2007-05 1 2007-04 1 2007-03 1 2007-02 1 2007-01 1 2023-08 1
# Removing first row in dataframe
regencyData = regencyData.iloc[1:]
# Filling redeveloment data with -1 values
regencyData['Year of last Completed Redev'] = regencyData['Year of last Completed Redev'].fillna('Never')
# Dropping null Control and Gross Leasable area rows
regencyData = regencyData.dropna(subset=['Control','Gross Leasable Area (SF)'])
#Getting Nan values for every column, checking no null values are left
nan_count = regencyData.isna().sum().sort_values(ascending=False)
for index, row in nan_count.items():
print(f'{index} {row}')
Business Unit 0 2015-11 0 2015-09 0 2015-08 0 2015-07 0 2015-06 0 2015-05 0 2015-04 0 2015-03 0 2015-02 0 2015-01 0 2014-12 0 2014-11 0 2014-10 0 2014-09 0 2014-08 0 2014-07 0 2015-10 0 2015-12 0 2017-06 0 2016-01 0 2017-04 0 2017-03 0 2017-02 0 2017-01 0 2016-12 0 2016-11 0 2016-10 0 2016-09 0 2016-08 0 2016-07 0 2016-06 0 2016-05 0 2016-04 0 2016-03 0 2016-02 0 2014-06 0 2014-05 0 2014-04 0 2014-03 0 2012-08 0 2012-07 0 2012-06 0 2012-05 0 2012-04 0 2012-03 0 2012-02 0 2012-01 0 2011-12 0 2011-11 0 2011-10 0 2011-09 0 2011-08 0 2011-07 0 2011-06 0 2012-09 0 2012-10 0 2012-11 0 2013-08 0 2014-02 0 2014-01 0 2013-12 0 2013-11 0 2013-10 0 2013-09 0 2013-07 0 2012-12 0 2013-06 0 2013-05 0 2013-04 0 2013-03 0 2013-02 0 2013-01 0 2017-05 0 2017-07 0 Division 0 2022-02 0 2021-12 0 2021-11 0 2021-10 0 2021-09 0 2021-08 0 2021-07 0 2021-06 0 2021-05 0 2021-04 0 2021-03 0 2021-02 0 2021-01 0 2020-12 0 2020-11 0 2020-10 0 2022-01 0 2022-03 0 2017-08 0 2022-04 0 2023-07 0 2023-06 0 2023-05 0 2023-04 0 2023-03 0 2023-02 0 2023-01 0 2022-12 0 2022-11 0 2022-10 0 2022-09 0 2022-08 0 2022-07 0 2022-06 0 2022-05 0 2020-09 0 2020-08 0 2020-07 0 2020-06 0 2018-11 0 2018-10 0 2018-09 0 2018-08 0 2018-07 0 2018-06 0 2018-05 0 2018-04 0 2018-03 0 2018-02 0 2018-01 0 2017-12 0 2017-11 0 2017-10 0 2017-09 0 2018-12 0 2019-01 0 2019-02 0 2019-11 0 2020-05 0 2020-04 0 2020-03 0 2020-02 0 2020-01 0 2019-12 0 2019-10 0 2019-03 0 2019-09 0 2019-08 0 2019-07 0 2019-06 0 2019-05 0 2019-04 0 2011-05 0 2011-04 0 2011-03 0 2003-06 0 2003-04 0 2003-03 0 2003-02 0 2003-01 0 2002-12 0 2002-11 0 2002-10 0 2002-09 0 2002-08 0 2002-07 0 2002-06 0 2002-05 0 2002-04 0 2002-03 0 2002-02 0 2003-05 0 2003-07 0 2011-02 0 2003-08 0 2004-11 0 2004-10 0 2004-09 0 2004-08 0 2004-07 0 2004-06 0 2004-05 0 2004-04 0 2004-03 0 2004-02 0 2004-01 0 2003-12 0 2003-11 0 2003-10 0 2003-09 0 2002-01 0 2001-12 0 2001-11 0 2001-10 0 2000-03 0 2000-02 0 2000-01 0 Description 0 Expense 0 Category 0 Control 0 Year Acquired 0 Year of last Completed Redev 0 Year Constructed 0 Gross Leasable Area (SF) 0 Zip Code 0 City 0 State 0 Region 0 2000-04 0 2000-05 0 2000-06 0 2001-03 0 2001-09 0 2001-08 0 2001-07 0 2001-06 0 2001-05 0 2001-04 0 2001-02 0 2000-07 0 2001-01 0 2000-12 0 2000-11 0 2000-10 0 2000-09 0 2000-08 0 2004-12 0 2005-01 0 2005-02 0 2008-03 0 2009-07 0 2009-06 0 2009-05 0 2009-04 0 2009-03 0 2009-02 0 2009-01 0 2008-12 0 2008-11 0 2008-10 0 2008-09 0 2008-08 0 2008-07 0 2008-06 0 2008-05 0 2009-08 0 2009-09 0 2009-10 0 2010-07 0 2011-01 0 2010-12 0 2010-11 0 2010-10 0 2010-09 0 2010-08 0 2010-06 0 2009-11 0 2010-05 0 2010-04 0 2010-03 0 2010-02 0 2010-01 0 2009-12 0 2008-04 0 2008-02 0 2005-03 0 2008-01 0 2006-06 0 2006-05 0 2006-04 0 2006-03 0 2006-02 0 2006-01 0 2005-12 0 2005-11 0 2005-10 0 2005-09 0 2005-08 0 2005-07 0 2005-06 0 2005-05 0 2005-04 0 2006-07 0 2006-08 0 2006-09 0 2007-06 0 2007-12 0 2007-11 0 2007-10 0 2007-09 0 2007-08 0 2007-07 0 2007-05 0 2006-10 0 2007-04 0 2007-03 0 2007-02 0 2007-01 0 2006-12 0 2006-11 0 2023-08 0
We will create a new dataframe called unit_df where it only includes information related to units. This step is done in order to to focus on variables related to the units only without having duplicate rows due to expenses.
# Getting only the unique Id in the Business Unit
unitId = regencyData['Business Unit'].unique()
#Creating new dataframe with data related to unit information without repairs
unit_df = pd.DataFrame(columns=['Business Unit', 'Division', 'Region', 'State', 'Zip Code', 'Lat', 'Long', 'Year Constructed', 'Year Acquired', 'Expenses'])
for unit in unitId:
filtered_row = regencyData[regencyData['Business Unit'] == unit]
if not filtered_row.empty:
# Getting only first row with unique id
first_row = filtered_row.iloc[0]
empty_row = {'Business Unit': first_row['Business Unit'], 'Division': first_row['Division'], 'Region': first_row['Region'],
'State': first_row['State'], 'Zip Code': first_row['Zip Code'], 'Lat': 0, 'Long': 0,
'Year Constructed': first_row['Year Constructed'], 'Year Acquired': first_row['Year Acquired'],
'Expenses': -1}
# Append to another DataFrame
unit_df = unit_df.append(empty_row, ignore_index=True)
# Expenses column will be used for various analysis, it doest have a specific task
unit_df.head()
| Business Unit | Division | Region | State | Zip Code | Lat | Long | Year Constructed | Year Acquired | Expenses | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | Northeast | PA-DE | Pennsylvania | 19087 | 0 | 0 | 1960.0 | 2004.0 | -1 |
| 1 | 18.0 | Southeast | Tampa | Florida | 33511 | 0 | 0 | 1986.0 | 1993.0 | -1 |
| 2 | 20.0 | Southeast | Jacksonville | Florida | 32606 | 0 | 0 | 1986.0 | 1994.0 | -1 |
| 3 | 43.0 | Southeast | Jacksonville | Florida | 32605 | 0 | 0 | 1974.0 | 1993.0 | -1 |
| 4 | 51.0 | Southeast | Tampa | Florida | 34104 | 0 | 0 | 1992.0 | 1994.0 | -1 |
We want to explore the effect of a place being a coastal city on various expenses. But we want to make sure if there is any units near coastal areas. For that we will plot them on a heatmap using folium library.
# getting Latitude and Longitude of the Zipcodes
def get_lat_long(zip_code, country='us'):
nomi = pgeocode.Nominatim(country)
location = nomi.query_postal_code(zip_code)
if location is not None and not location.empty:
return location.latitude, location.longitude
else:
return None, None
# Example usage
latitude, longitude = get_lat_long("10001") # Replace with any ZIP code
print("Latitude:", latitude, "Longitude:", longitude)
Latitude: 40.7484 Longitude: -73.9967
# inserting lat and long of each zipcode into the the dataframe
# There are some missing values due to ZIPCode being 4 or less digits
for index, row in unit_df.iterrows():
latitude, longitude = get_lat_long(row['Zip Code'])
unit_df.at[index, 'Lat'] = latitude
unit_df.at[index, 'Long'] = longitude
print(row['Zip Code']," Latitude:", latitude, "Longitude:", longitude)
19087 Latitude: 40.0612 Longitude: -75.3999 33511 Latitude: 27.9056 Longitude: -82.2881 32606 Latitude: 29.6954 Longitude: -82.4023 32605 Latitude: 29.6785 Longitude: -82.3679 34104 Latitude: 26.1529 Longitude: -81.7417 33351 Latitude: 26.1793 Longitude: -80.2746 32309 Latitude: 30.5422 Longitude: -84.1413 32043 Latitude: 29.9983 Longitude: -81.7647 33180 Latitude: 25.9597 Longitude: -80.1403 32257 Latitude: 30.1927 Longitude: -81.605 33458 Latitude: 26.9339 Longitude: -80.1201 33618 Latitude: 28.0763 Longitude: -82.4852 33414 Latitude: 26.6587 Longitude: -80.2414 30319 Latitude: 33.8687 Longitude: -84.3351 32257 Latitude: 30.1927 Longitude: -81.605 27707 Latitude: 35.9631 Longitude: -78.9315 28226 Latitude: 35.0869 Longitude: -80.8167 45209 Latitude: 39.1516 Longitude: -84.4278 33015 Latitude: 25.9388 Longitude: -80.3165 33426 Latitude: 26.5175 Longitude: -80.0834 32073 Latitude: 30.1637 Longitude: -81.7291 30067 Latitude: 33.9282 Longitude: -84.4733 33511 Latitude: 27.9056 Longitude: -82.2881 32003 Latitude: 30.0933 Longitude: -81.719 19808 Latitude: 39.7359 Longitude: -75.6647 27713 Latitude: 35.9112 Longitude: -78.9178 60527 Latitude: 41.7447 Longitude: -87.9334 75034 Latitude: 33.1499 Longitude: -96.8241 43054 Latitude: 40.0847 Longitude: -82.7988 32258 Latitude: 30.1459 Longitude: -81.5739 92835 Latitude: 33.8994 Longitude: -117.9063 32080 Latitude: 29.7964 Longitude: -81.2649 94530 Latitude: 37.9156 Longitude: -122.2985 30318 Latitude: 33.7865 Longitude: -84.4454 32304 Latitude: 30.4478 Longitude: -84.3211 33186 Latitude: 25.6694 Longitude: -80.4085 80134 Latitude: 39.4895 Longitude: -104.8447 80921 Latitude: 39.0487 Longitude: -104.814 80920 Latitude: 38.9497 Longitude: -104.767 48430 Latitude: 42.7851 Longitude: -83.7294 37076 Latitude: 36.1848 Longitude: -86.6002 30338 Latitude: 33.9669 Longitude: -84.3249 30084 Latitude: 33.857 Longitude: -84.216 30345 Latitude: 33.8513 Longitude: -84.287 30342 Latitude: 33.8842 Longitude: -84.3761 30328 Latitude: 33.9335 Longitude: -84.3958 27608 Latitude: 35.8077 Longitude: -78.6463 30327 Latitude: 33.8627 Longitude: -84.42 30339 Latitude: 33.8713 Longitude: -84.4629 33609 Latitude: 27.9425 Longitude: -82.5057 30338 Latitude: 33.9669 Longitude: -84.3249 30329 Latitude: 33.8236 Longitude: -84.3214 30342 Latitude: 33.8842 Longitude: -84.3761 37027 Latitude: 36.0063 Longitude: -86.7909 37069 Latitude: 35.9796 Longitude: -86.9106 33928 Latitude: 26.4351 Longitude: -81.8102 60154 Latitude: 41.8524 Longitude: -87.8845 32092 Latitude: 29.9475 Longitude: -81.5264 45241 Latitude: 39.2638 Longitude: -84.4092 93446 Latitude: 35.6406 Longitude: -120.7003 32259 Latitude: 30.0815 Longitude: -81.5477 98203 Latitude: 47.9419 Longitude: -122.2218 80301 Latitude: 40.0497 Longitude: -105.2143 80301 Latitude: 40.0497 Longitude: -105.2143 91355 Latitude: 34.3985 Longitude: -118.5535 77479 Latitude: 29.5785 Longitude: -95.6066 90740 Latitude: 33.7602 Longitude: -118.0808 20171 Latitude: 38.9252 Longitude: -77.3928 27605 Latitude: 35.7908 Longitude: -78.653 20136 Latitude: 38.7343 Longitude: -77.5474 45255 Latitude: 39.0584 Longitude: -84.3396 43213 Latitude: 39.9727 Longitude: -82.8329 43082 Latitude: 40.1524 Longitude: -82.882 45069 Latitude: 39.3402 Longitude: -84.3998 27502 Latitude: 35.7225 Longitude: -78.8408 80003 Latitude: 39.8286 Longitude: -105.0655 75034 Latitude: 33.1499 Longitude: -96.8241 94608 Latitude: 37.8365 Longitude: -122.2804 92596 Latitude: 33.6243 Longitude: -117.0885 80920 Latitude: 38.9497 Longitude: -104.767 94521 Latitude: 37.9575 Longitude: -121.975 80111 Latitude: 39.6123 Longitude: -104.8799 77382 Latitude: 30.2106 Longitude: -95.5257 77381 Latitude: 30.1716 Longitude: -95.4985 77381 Latitude: 30.1716 Longitude: -95.4985 91361 Latitude: 34.1472 Longitude: -118.8383 75205 Latitude: 32.826 Longitude: -96.7843 75214 Latitude: 32.8248 Longitude: -96.7498 75230 Latitude: 32.8999 Longitude: -96.7897 78681 Latitude: 30.5083 Longitude: -97.6789 78759 Latitude: 30.4036 Longitude: -97.7526 80222 Latitude: 39.671 Longitude: -104.9279 80013 Latitude: 39.6575 Longitude: -104.7846 80120 Latitude: 39.5994 Longitude: -105.0044 94506 Latitude: 37.8321 Longitude: -121.9167 94598 Latitude: 37.9194 Longitude: -122.0259 95128 Latitude: 37.3163 Longitude: -121.9356 94061 Latitude: 37.4647 Longitude: -122.2304 94577 Latitude: 37.7205 Longitude: -122.1587 94087 Latitude: 37.3502 Longitude: -122.0349 94583 Latitude: 37.7562 Longitude: -121.9522 91364 Latitude: 34.1557 Longitude: -118.6 90254 Latitude: 33.8643 Longitude: -118.3955 91331 Latitude: 34.2556 Longitude: -118.4208 91360 Latitude: 34.2092 Longitude: -118.8739 92604 Latitude: 33.6899 Longitude: -117.7868 92835 Latitude: 33.8994 Longitude: -117.9063 92703 Latitude: 33.7489 Longitude: -117.9072 92648 Latitude: 33.6773 Longitude: -118.0051 92026 Latitude: 33.1605 Longitude: -117.0978 92108 Latitude: 32.7783 Longitude: -117.1335 92064 Latitude: 32.9756 Longitude: -117.0402 98188 Latitude: 47.4483 Longitude: -122.2731 97005 Latitude: 45.4963 Longitude: -122.8001 98075 Latitude: 47.5857 Longitude: -122.0345 98074 Latitude: 47.6254 Longitude: -122.0462 98074 Latitude: 47.6254 Longitude: -122.0462 97007 Latitude: 45.4505 Longitude: -122.8652 78751 Latitude: 30.3093 Longitude: -97.7242 91361 Latitude: 34.1472 Longitude: -118.8383 91361 Latitude: 34.1472 Longitude: -118.8383 76248 Latitude: 32.9276 Longitude: -97.2489 97140 Latitude: 45.3514 Longitude: -122.8567 95630 Latitude: 38.6709 Longitude: -121.1529 17033 Latitude: 40.2638 Longitude: -76.6545 32091 Latitude: 29.9583 Longitude: -82.1185 20112 Latitude: 38.6665 Longitude: -77.4248 21037 Latitude: 38.9149 Longitude: -76.5424 92127 Latitude: 33.0279 Longitude: -117.0856 46375 Latitude: 41.4922 Longitude: -87.4605 77354 Latitude: 30.2333 Longitude: -95.5502 80831 Latitude: 38.9541 Longitude: -104.5472 45227 Latitude: 39.1539 Longitude: -84.3854 32065 Latitude: 30.1382 Longitude: -81.7742 1906 Latitude: nan Longitude: nan 97124 Latitude: 45.5387 Longitude: -122.9636 2143 Latitude: nan Longitude: nan 32081 Latitude: 30.1204 Longitude: -81.4128 80634 Latitude: 40.4109 Longitude: -104.7541 80023 Latitude: 39.9619 Longitude: -105.0148 27615 Latitude: 35.8887 Longitude: -78.6393 97330 Latitude: 44.5904 Longitude: -123.2722 34604 Latitude: 28.4409 Longitude: -82.4612 18045 Latitude: 40.6957 Longitude: -75.2865 22079 Latitude: 38.6929 Longitude: -77.204 33073 Latitude: 26.2997 Longitude: -80.181 22030 Latitude: 38.8458 Longitude: -77.3242 32003 Latitude: 30.0933 Longitude: -81.719 33928 Latitude: 26.4351 Longitude: -81.8102 33558 Latitude: 28.1474 Longitude: -82.5152 34109 Latitude: 26.2534 Longitude: -81.7644 33618 Latitude: 28.0763 Longitude: -82.4852 27615 Latitude: 35.8887 Longitude: -78.6393 28262 Latitude: 35.3183 Longitude: -80.7476 34608 Latitude: 28.4797 Longitude: -82.5562 97501 Latitude: 42.2818 Longitude: -122.9054 60025 Latitude: 42.0758 Longitude: -87.8223 60062 Latitude: 42.1254 Longitude: -87.8465 1906 Latitude: nan Longitude: nan 63368 Latitude: 38.7513 Longitude: -90.7296 63044 Latitude: 38.7506 Longitude: -90.4161 63144 Latitude: 38.6182 Longitude: -90.3489 63122 Latitude: 38.5781 Longitude: -90.4256 33174 Latitude: 25.7628 Longitude: -80.3611 27705 Latitude: 36.0218 Longitude: -78.9478 98029 Latitude: 47.5585 Longitude: -122.0055 95618 Latitude: 38.5449 Longitude: -121.7405 55442 Latitude: 45.0467 Longitude: -93.431 78753 Latitude: 30.3649 Longitude: -97.6827 11755 Latitude: 40.8567 Longitude: -73.1168 92823 Latitude: 33.923 Longitude: -117.798 77494 Latitude: 29.7404 Longitude: -95.8304 94568 Latitude: 37.7166 Longitude: -121.9226 28027 Latitude: 35.4141 Longitude: -80.6162 92782 Latitude: 33.7346 Longitude: -117.7869 20147 Latitude: 39.0373 Longitude: -77.4805 92111 Latitude: 32.7972 Longitude: -117.1708 27514 Latitude: 35.9203 Longitude: -79.0372 92103 Latitude: 32.7466 Longitude: -117.1636 30328 Latitude: 33.9335 Longitude: -84.3958 80033 Latitude: 39.774 Longitude: -105.0962 20910 Latitude: 38.9982 Longitude: -77.0338 75082 Latitude: 32.9865 Longitude: -96.686 80602 Latitude: 39.9636 Longitude: -104.9072 32202 Latitude: 30.3299 Longitude: -81.6517 2155 Latitude: nan Longitude: nan 20151 Latitude: 38.8867 Longitude: -77.4457 27609 Latitude: 35.848 Longitude: -78.6317 78746 Latitude: 30.2971 Longitude: -97.8181 6825 Latitude: nan Longitude: nan 6824 Latitude: nan Longitude: nan 6824 Latitude: nan Longitude: nan 6824 Latitude: nan Longitude: nan 6824 Latitude: nan Longitude: nan 6825 Latitude: nan Longitude: nan 6825 Latitude: nan Longitude: nan 6824 Latitude: nan Longitude: nan 60614 Latitude: 41.9229 Longitude: -87.6483 33612 Latitude: 28.0502 Longitude: -82.45 77382 Latitude: 30.2106 Longitude: -95.5257 98102 Latitude: 47.6302 Longitude: -122.321 33431 Latitude: 26.3799 Longitude: -80.0975 27609 Latitude: 35.848 Longitude: -78.6317 60061 Latitude: 42.2288 Longitude: -87.9719 11040 Latitude: 40.7294 Longitude: -73.6828 77479 Latitude: 29.5785 Longitude: -95.6066 33156 Latitude: 25.6682 Longitude: -80.2973 75082 Latitude: 32.9865 Longitude: -96.686 97501 Latitude: 42.2818 Longitude: -122.9054 29464 Latitude: 32.8473 Longitude: -79.8206 95468 Latitude: 38.9152 Longitude: -123.6 77389 Latitude: 30.1044 Longitude: -95.5066 98029 Latitude: 47.5585 Longitude: -122.0055 32266 Latitude: 30.3155 Longitude: -81.4051 10522 Latitude: 41.0118 Longitude: -73.8665 90036 Latitude: 34.0699 Longitude: -118.3492 11557 Latitude: 40.6404 Longitude: -73.6957 92131 Latitude: 32.9123 Longitude: -117.0898 91355 Latitude: 34.3985 Longitude: -118.5535 93446 Latitude: 35.6406 Longitude: -120.7003 8840 Latitude: nan Longitude: nan 11557 Latitude: 40.6404 Longitude: -73.6957 98115 Latitude: 47.6849 Longitude: -122.2968 98107 Latitude: 47.6701 Longitude: -122.3763 98107 Latitude: 47.6701 Longitude: -122.3763 27607 Latitude: 35.8014 Longitude: -78.6877 90804 Latitude: 33.7857 Longitude: -118.1357 94598 Latitude: 37.9194 Longitude: -122.0259 11530 Latitude: 40.7245 Longitude: -73.6487 98122 Latitude: 47.6116 Longitude: -122.3056 11590 Latitude: 40.7557 Longitude: -73.5723 8807 Latitude: nan Longitude: nan 8807 Latitude: nan Longitude: nan 98115 Latitude: 47.6849 Longitude: -122.2968 95008 Latitude: 37.2803 Longitude: -121.9539 33177 Latitude: 25.5968 Longitude: -80.4046 28277 Latitude: 35.0552 Longitude: -80.8195 28277 Latitude: 35.0552 Longitude: -80.8195 11580 Latitude: 40.6742 Longitude: -73.7057 11554 Latitude: 40.7149 Longitude: -73.5561 11792 Latitude: 40.952 Longitude: -72.8348 11941 Latitude: 40.8297 Longitude: -72.7283 77382 Latitude: 30.2106 Longitude: -95.5257 75002 Latitude: 33.0934 Longitude: -96.6454 94040 Latitude: 37.3855 Longitude: -122.088 30338 Latitude: 33.9669 Longitude: -84.3249 91384 Latitude: 34.4827 Longitude: -118.6254 75044 Latitude: 32.9522 Longitude: -96.6654 32708 Latitude: 28.6831 Longitude: -81.2814 95376 Latitude: 37.7383 Longitude: -121.4345 55124 Latitude: 44.7497 Longitude: -93.2029 20037 Latitude: 38.9014 Longitude: -77.0504 55416 Latitude: 44.9497 Longitude: -93.3373 10065 Latitude: 40.7651 Longitude: -73.9638 10065 Latitude: 40.7651 Longitude: -73.9638 10011 Latitude: 40.7402 Longitude: -73.9996 6880 Latitude: nan Longitude: nan 11374 Latitude: 40.7278 Longitude: -73.8602 6877 Latitude: nan Longitude: nan 32817 Latitude: 28.5891 Longitude: -81.2277 32233 Latitude: 30.3483 Longitude: -81.4159 33180 Latitude: 25.9597 Longitude: -80.1403 33176 Latitude: 25.6574 Longitude: -80.3627 33155 Latitude: 25.7392 Longitude: -80.3103 33433 Latitude: 26.3464 Longitude: -80.1564 33436 Latitude: 26.5354 Longitude: -80.1124 30114 Latitude: 34.2505 Longitude: -84.4909 10463 Latitude: 40.8798 Longitude: -73.9067 10463 Latitude: 40.8798 Longitude: -73.9067 6082 Latitude: nan Longitude: nan 30326 Latitude: 33.8482 Longitude: -84.3582 34986 Latitude: 27.3215 Longitude: -80.403 33948 Latitude: 26.9827 Longitude: -82.1412 30328 Latitude: 33.9335 Longitude: -84.3958 90815 Latitude: 33.7939 Longitude: -118.1192 11416 Latitude: 40.6838 Longitude: -73.8514 6880 Latitude: nan Longitude: nan 33165 Latitude: 25.7343 Longitude: -80.3588 6877 Latitude: nan Longitude: nan 33176 Latitude: 25.6574 Longitude: -80.3627 33330 Latitude: 26.0663 Longitude: -80.3339 90232 Latitude: 34.0168 Longitude: -118.3973 6811 Latitude: nan Longitude: nan 6854 Latitude: nan Longitude: nan 11530 Latitude: 40.7245 Longitude: -73.6487 33180 Latitude: 25.9597 Longitude: -80.1403 34231 Latitude: 27.2666 Longitude: -82.5163 33461 Latitude: 26.6232 Longitude: -80.0946 33196 Latitude: 25.6615 Longitude: -80.441 32811 Latitude: 28.5163 Longitude: -81.4516 32746 Latitude: 28.7577 Longitude: -81.3508 32257 Latitude: 30.1927 Longitude: -81.605 32250 Latitude: 30.2801 Longitude: -81.4165 34108 Latitude: 26.2416 Longitude: -81.8071 30305 Latitude: 33.832 Longitude: -84.3851 33324 Latitude: 26.1255 Longitude: -80.2644 33067 Latitude: 26.3033 Longitude: -80.2415 94596 Latitude: 37.9053 Longitude: -122.0549 33157 Latitude: 25.6062 Longitude: -80.3426 6820 Latitude: nan Longitude: nan 94103 Latitude: 37.7725 Longitude: -122.4147 33410 Latitude: 26.8234 Longitude: -80.1387 90815 Latitude: 33.7939 Longitude: -118.1192 32966 Latitude: 27.6372 Longitude: -80.4794 94070 Latitude: 37.4969 Longitude: -122.2674 33442 Latitude: 26.3124 Longitude: -80.1412 2360 Latitude: nan Longitude: nan 33021 Latitude: 26.0218 Longitude: -80.1891 33193 Latitude: 25.6964 Longitude: -80.4401 33477 Latitude: 26.9217 Longitude: -80.077 33408 Latitude: 26.8289 Longitude: -80.0603 33029 Latitude: 25.9924 Longitude: -80.4089 33173 Latitude: 25.6992 Longitude: -80.3618 33173 Latitude: 25.6992 Longitude: -80.3618 33179 Latitude: 25.9571 Longitude: -80.1814 32250 Latitude: 30.2801 Longitude: -81.4165 32968 Latitude: 27.5999 Longitude: -80.4382 6488 Latitude: nan Longitude: nan 2138 Latitude: nan Longitude: nan 2169 Latitude: nan Longitude: nan 2132 Latitude: nan Longitude: nan 33558 Latitude: 28.1474 Longitude: -82.5152 92673 Latitude: 33.4615 Longitude: -117.6375 33321 Latitude: 26.212 Longitude: -80.2696 33179 Latitude: 25.9571 Longitude: -80.1814 2138 Latitude: nan Longitude: nan 95616 Latitude: 38.5538 Longitude: -121.7418 34986 Latitude: 27.3215 Longitude: -80.403 30213 Latitude: 33.5648 Longitude: -84.5809 34743 Latitude: 28.3306 Longitude: -81.3544 32960 Latitude: 27.633 Longitude: -80.4031 32792 Latitude: 28.5974 Longitude: -81.3036 90815 Latitude: 33.7939 Longitude: -118.1192 6854 Latitude: nan Longitude: nan 33033 Latitude: 25.4906 Longitude: -80.438 33165 Latitude: 25.7343 Longitude: -80.3588 33193 Latitude: 25.6964 Longitude: -80.4401 11590 Latitude: 40.7557 Longitude: -73.5723 33317 Latitude: 26.1122 Longitude: -80.2264 33317 Latitude: 26.1122 Longitude: -80.2264 20816 Latitude: 38.9585 Longitude: -77.1153 20816 Latitude: 38.9585 Longitude: -77.1153 20816 Latitude: 38.9585 Longitude: -77.1153 30338 Latitude: 33.9669 Longitude: -84.3249 94520 Latitude: 37.9823 Longitude: -122.0362 94015 Latitude: 37.6787 Longitude: -122.478 94103 Latitude: 37.7725 Longitude: -122.4147 91344 Latitude: 34.2771 Longitude: -118.4992 91301 Latitude: 34.1227 Longitude: -118.7573 92821 Latitude: 33.9291 Longitude: -117.8845 92677 Latitude: 33.5145 Longitude: -117.7084 92119 Latitude: 32.8036 Longitude: -117.0261 92110 Latitude: 32.7635 Longitude: -117.2028 91941 Latitude: 32.7604 Longitude: -117.0115 94066 Latitude: 37.6247 Longitude: -122.429 95051 Latitude: 37.3483 Longitude: -121.9844 94523 Latitude: 37.954 Longitude: -122.0737 94558 Latitude: 38.4549 Longitude: -122.2564 95136 Latitude: 37.2685 Longitude: -121.849 94598 Latitude: 37.9194 Longitude: -122.0259 93105 Latitude: 34.4369 Longitude: -119.7285 80302 Latitude: 40.0172 Longitude: -105.2851 80033 Latitude: 39.774 Longitude: -105.0962 80122 Latitude: 39.5814 Longitude: -104.9557 80004 Latitude: 39.8141 Longitude: -105.1177 6110 Latitude: nan Longitude: nan 20016 Latitude: 38.9381 Longitude: -77.086 19810 Latitude: 39.8188 Longitude: -75.5064 60714 Latitude: 42.0312 Longitude: -87.8112 60608 Latitude: 41.8515 Longitude: -87.6694 60618 Latitude: 41.9464 Longitude: -87.7042 46268 Latitude: 39.8682 Longitude: -86.2123 46268 Latitude: 39.8682 Longitude: -86.2123 21208 Latitude: 39.3764 Longitude: -76.729 21234 Latitude: 39.3876 Longitude: -76.5418 21230 Latitude: 39.2645 Longitude: -76.6224 20874 Latitude: 39.1355 Longitude: -77.2822 20878 Latitude: 39.1125 Longitude: -77.2515 20912 Latitude: 38.9832 Longitude: -77.0007 20774 Latitude: 38.8682 Longitude: -76.8156 20901 Latitude: 39.0191 Longitude: -77.0076 55391 Latitude: 44.9847 Longitude: -93.5422 55442 Latitude: 45.0467 Longitude: -93.431 27511 Latitude: 35.7641 Longitude: -78.7786 7470 Latitude: nan Longitude: nan 8108 Latitude: nan Longitude: nan 97223 Latitude: 45.4403 Longitude: -122.7793 18102 Latitude: 40.6068 Longitude: -75.4781 18017 Latitude: 40.6622 Longitude: -75.3903 19151 Latitude: 39.9772 Longitude: -75.2545 18901 Latitude: 40.3054 Longitude: -75.1489 19073 Latitude: 39.9863 Longitude: -75.407 18929 Latitude: 40.2566 Longitude: -75.0961 77005 Latitude: 29.7179 Longitude: -95.4263 77057 Latitude: 29.7422 Longitude: -95.4903 23111 Latitude: 37.6281 Longitude: -77.3395 23226 Latitude: 37.5825 Longitude: -77.5197 20147 Latitude: 39.0373 Longitude: -77.4805 20121 Latitude: 38.8195 Longitude: -77.4558 22310 Latitude: 38.7794 Longitude: -77.1194 20171 Latitude: 38.9252 Longitude: -77.3928 22033 Latitude: 38.8776 Longitude: -77.3885 22030 Latitude: 38.8458 Longitude: -77.3242 22151 Latitude: 38.8029 Longitude: -77.2116 22153 Latitude: 38.7449 Longitude: -77.237 22044 Latitude: 38.8589 Longitude: -77.1548 22044 Latitude: 38.8589 Longitude: -77.1548 98026 Latitude: 47.8353 Longitude: -122.327 98006 Latitude: 47.5614 Longitude: -122.1552 98052 Latitude: 47.6718 Longitude: -122.1232 77005 Latitude: 29.7179 Longitude: -95.4263 29412 Latitude: 32.718 Longitude: -79.9537 95667 Latitude: 38.7195 Longitude: -120.8046 32092 Latitude: 29.9475 Longitude: -81.5264 34119 Latitude: 26.2665 Longitude: -81.7146 2155 Latitude: nan Longitude: nan 98102 Latitude: 47.6302 Longitude: -122.321 8807 Latitude: nan Longitude: nan 8807 Latitude: nan Longitude: nan 22201 Latitude: 38.8871 Longitude: -77.0932 22201 Latitude: 38.8871 Longitude: -77.0932 22201 Latitude: 38.8871 Longitude: -77.0932 22201 Latitude: 38.8871 Longitude: -77.0932 22201 Latitude: 38.8871 Longitude: -77.0932 32819 Latitude: 28.4522 Longitude: -81.4678 32819 Latitude: 28.4522 Longitude: -81.4678 30114 Latitude: 34.2505 Longitude: -84.4909 6880 Latitude: nan Longitude: nan 6880 Latitude: nan Longitude: nan 6880 Latitude: nan Longitude: nan 6880 Latitude: nan Longitude: nan
unit_df.head()
| Business Unit | Division | Region | State | Zip Code | Lat | Long | Year Constructed | Year Acquired | Expenses | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | Northeast | PA-DE | Pennsylvania | 19087 | 40.0612 | -75.3999 | 1960.0 | 2004.0 | -1 |
| 1 | 18.0 | Southeast | Tampa | Florida | 33511 | 27.9056 | -82.2881 | 1986.0 | 1993.0 | -1 |
| 2 | 20.0 | Southeast | Jacksonville | Florida | 32606 | 29.6954 | -82.4023 | 1986.0 | 1994.0 | -1 |
| 3 | 43.0 | Southeast | Jacksonville | Florida | 32605 | 29.6785 | -82.3679 | 1974.0 | 1993.0 | -1 |
| 4 | 51.0 | Southeast | Tampa | Florida | 34104 | 26.1529 | -81.7417 | 1992.0 | 1994.0 | -1 |
unit_df_no_null = unit_df.dropna(subset=['Lat','Long'])
# Create a map instance
m = folium.Map(location=[unit_df_no_null['Lat'].mean(), unit_df_no_null['Long'].mean()], zoom_start=6)
# Create a list of lat and lon pairs
heat_data = [[row['Lat'], row['Long']] for index, row in unit_df_no_null.iterrows()]
# Create a HeatMap layer and add it to the map
HeatMap(heat_data).add_to(m)
# Save it to an HTML file
m.save('heatmap.html')
The map shows that the main focus of regency areas is near water areas. However, there is enough sample of regency malls far from beaches and water surfaces. Looking at the map there is no need to write a code to determine if the location is near a water surface or not, the malls are either all away from water surfaces or near them by state. The only exception to this rule might be Houston in Texas where it is near a water surface.
m
To determine if a mall is near water or not, we can generalize the analysis by considering the state in which the mall is located. This approach can be helpful as taxes are also based on state tax rates. Going to manually assign near water value for each state
unit_df['State'].unique()
array(['Pennsylvania', 'Florida', 'Georgia', 'North Carolina', 'Ohio',
'Delaware', 'Illinois', 'Texas', 'California', 'Colorado',
'Michigan', 'Tennessee', 'Washington', 'Virginia', 'Oregon',
'Maryland', 'Indiana', 'Massachusetts', 'Missouri', 'Minnesota',
'New York', 'Connecticut', 'South Carolina', 'New Jersey',
'District of Columbia'], dtype=object)
unit_df['Near Water'] = np.zeros
# Decision for each state is based on heatmap
# 'Pennsylvania' : 0
# 'Florida' : 1
# 'Georgia' : 0
# 'North Carolina' : 0
# 'Ohio' : 0
# 'Delaware' : 1
# 'Illinois' : 1
# 'Texas' : Not sure but will do 0 for now
# 'California' : 1
# 'Colorado' : 0
# 'Michigan' : 0
# 'Virginia' : 0
# 'Oregon' : 1 I think maybe it's near water
# 'Maryland' : 1
# 'Indiana' : 0
# 'Massachusetts' : -1 maybe one of the Nan values
# 'Missouri' : 0
# 'Minnesota' : 0
# 'New York' : 1
# 'Connecticut' : 1 it is not showing but i believe its on beach following NYC shops trend
# 'South Carolina' : 1
# 'New Jersey': 0
# 'District of Columbia': 0
statesWithWater = ['Florida','Delaware', 'Illinois', 'California','Oregon', 'Maryland', 'New York', 'Connecticut', 'South Carolina']
# Changing values of rows to states where we determine if stores are near beach or not
unit_df['Near Water'] = unit_df['State'].isin(statesWithWater).astype(int)
unit_df.head()
| Business Unit | Division | Region | State | Zip Code | Lat | Long | Year Constructed | Year Acquired | Expenses | Near Water | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | Northeast | PA-DE | Pennsylvania | 19087 | 40.0612 | -75.3999 | 1960.0 | 2004.0 | -1 | 0 |
| 1 | 18.0 | Southeast | Tampa | Florida | 33511 | 27.9056 | -82.2881 | 1986.0 | 1993.0 | -1 | 1 |
| 2 | 20.0 | Southeast | Jacksonville | Florida | 32606 | 29.6954 | -82.4023 | 1986.0 | 1994.0 | -1 | 1 |
| 3 | 43.0 | Southeast | Jacksonville | Florida | 32605 | 29.6785 | -82.3679 | 1974.0 | 1993.0 | -1 | 1 |
| 4 | 51.0 | Southeast | Tampa | Florida | 34104 | 26.1529 | -81.7417 | 1992.0 | 1994.0 | -1 | 1 |
Doing analysis on Category expenses instead of Expense. Expense column is very detailed and varies greatly between every month, it might have high accuracy over yearly predictions but too valotile for monthly predictions.
# Getting column names with dates
dates = []
for index, date in enumerate(regencyData.columns):
if (index >= 14):
dates.append(date)
# Combining expenses per month using Control column
unitExpensesPerMonth = regencyData.groupby(['Business Unit', 'Category'])[dates].sum().reset_index()
unitExpensesPerMonth.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 18.0 |
| Category | Insurance Expense | Non-CAM Expenses | Operating Expenses | Real Estate Tax Expense | Insurance Expense |
| 2000-01 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-02 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-03 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-04 | -12948.47 | -720.05 | -73458.0 | -38166.58 | -22037.25 |
| 2023-05 | -12948.47 | -482.55 | -25715.77 | -38166.58 | -22037.25 |
| 2023-06 | -18159.12 | -1933.35 | -77736.53 | -38166.59 | -31604.16 |
| 2023-07 | -14685.33 | -479.55 | -96710.95 | -45473.42 | -25226.22 |
| 2023-08 | -14685.33 | -6986.82 | -72839.18 | -39210.42 | -25226.22 |
286 rows × 5 columns
unique_business_units = unitExpensesPerMonth['Business Unit'].unique().tolist()
Plotting each of the expenses for all the units over the years to explore if there is any trends
def get_list_category(unitId, category):
expensesList = unitExpensesPerMonth.loc[(unitExpensesPerMonth['Business Unit'] == unitId) &
(unitExpensesPerMonth['Category'] == category)].values.tolist()
expensesList = expensesList[0]
expensesList = expensesList[2:]
return expensesList
This is a plot to see trends in the units. What is surprising is that there is positive expenses which most likely is an error
# Create a figure
plt.figure(figsize=(10, 6))
# Labels for each line
#labels = ['Line 1', 'Line 2', 'Line 3']
# Loop over DataFrames and add each as a line chart
for unitId in unitExpensesPerMonth['Business Unit'].unique().tolist():
try:
y = get_list_category(unitId, 'Insurance Expense')
plt.plot(dates, y, label=unitId)
except:
print(unitId)
# Add labels and title
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Multiple Line Charts')
plt.xticks(rotation=90)
# Rotate labels for better readability
#plt.gcf().autofmt_xdate()
# Set x-axis to show every third tick
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
# Add legend
plt.legend()
# Show the plot
plt.show()
60538.0 60780.0 60837.0 60838.0 60860.0 60869.0 60876.0 60877.0 60880.0 80097.0 92053.0
First plot is too clustered. So each unit will be plotted based on states. Wich is generally better since tax laws differ from state to state.
def plot_by_month(df, state, category, water):
# Create a figure
plt.figure(figsize=(10, 6))
# Labels for each line
#labels = ['Line 1', 'Line 2', 'Line 3']
# Loop over DataFrames and add each as a line chart
for unitId in df['Business Unit'].unique().tolist():
try:
if unitId not in [170.0]:
y = get_list_category(unitId, category)
plotFlag = True
for value in y:
if value < -50000:
plotFlag = False
if plotFlag:
# plt.plot(dates, y, label=unitId)
plt.plot(dates, y)
except:
print(unitId)
# Add labels and title
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title(f'{state} {category}')
plt.xticks(rotation=90)
# Rotate labels for better readability
#plt.gcf().autofmt_xdate()
# Set x-axis to show every third tick
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
if(water):
plt.gca().set_facecolor('lightblue')
# Add legend
plt.legend()
# Show the plot
plt.show()
Based on the followingplots what was noticed is the following:
- All the states have similar trends in general
- States that are marked near water are most likely to have more than 10,000 insurance per month
- The insurance was stable till 2020 and insurance prices started dramatically increasing, most likely due to covid.
- State is very good indicator for expenses as in most graphs the spending trends and costs are correlated and clustered together
# plots with light blue background are states where shops are near water
# Plotting insurance expenses
for state in unit_df['State'].unique().tolist():
df = unit_df[unit_df['State'] == state]
nearWater = df['Near Water'].iloc[0]
plot_by_month(df, state, 'Insurance Expense',nearWater)
80097.0
60780.0 60837.0 60838.0 60860.0 92053.0
60538.0
60869.0 60876.0
60877.0 60880.0
There is no apparent trend in Non-Cam Expenses. In general is seems to be very volatile and changing from month to month
# plots with light blue background are states where shops are near water
# Plotting Non-CAM Expenses
for state in unit_df['State'].unique().tolist():
df = unit_df[unit_df['State'] == state]
nearWater = df['Near Water'].iloc[0]
plot_by_month(df, state, 'Non-CAM Expenses',nearWater)
The operating expenses vary from month to month. It seems to be slowly increasing in a steady rate. Using a linear or logistic regresstion function to predict it seems feasable.
# plots with light blue background are states where shops are near water
# Plotting insurance expenses
for state in unit_df['State'].unique().tolist():
df = unit_df[unit_df['State'] == state]
nearWater = df['Near Water'].iloc[0]
plot_by_month(df, state, 'Operating Expenses',nearWater)
Tax expense is slowly increasing over the year. Linear equation would be sufficient for prediction
# plots with light blue background are states where shops are near water
# Plotting Real Estate Tax Expense
for state in unit_df['State'].unique().tolist():
df = unit_df[unit_df['State'] == state]
nearWater = df['Near Water'].iloc[0]
plot_by_month(df, state, 'Real Estate Tax Expense',nearWater)
60780.0 60837.0 60838.0 60860.0
60538.0
60869.0 60876.0
60709.0
60877.0 60880.0
Create correlation heatmap to determine the effect of construction year on expenses
averageExpenses = unitExpensesPerMonth.copy()
averageExpenses.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 18.0 |
| Category | Insurance Expense | Non-CAM Expenses | Operating Expenses | Real Estate Tax Expense | Insurance Expense |
| 2000-01 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-02 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-03 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-04 | -12948.47 | -720.05 | -73458.0 | -38166.58 | -22037.25 |
| 2023-05 | -12948.47 | -482.55 | -25715.77 | -38166.58 | -22037.25 |
| 2023-06 | -18159.12 | -1933.35 | -77736.53 | -38166.59 | -31604.16 |
| 2023-07 | -14685.33 | -479.55 | -96710.95 | -45473.42 | -25226.22 |
| 2023-08 | -14685.33 | -6986.82 | -72839.18 | -39210.42 | -25226.22 |
286 rows × 5 columns
getting average expense for each unit and Insurance Expense category over all the years.
averageExpenses['Average Expenses'] = np.zeros(len(averageExpenses))
length = len(dates)
for index, row in averageExpenses.iterrows():
total = 0
for date in dates:
total += row[date] # Assuming each 'date' column contains numerical values
averageExpenses.at[index, 'Average Expenses'] = total / length
# Initialize 'Year Constructed' column with zeros (or another default value)
averageExpenses['Year Constructed'] = np.zeros(len(averageExpenses))
for index, row in averageExpenses.iterrows():
# Get 'Year Constructed' from unit_df for the matching 'Business Unit'
matching_rows = unit_df[unit_df['Business Unit'] == row['Business Unit']]['Year Constructed']
# Check if there are matching rows and assign the value
if not matching_rows.empty:
# Assuming you want the first match if there are multiple
averageExpenses.at[index, 'Year Constructed'] = matching_rows.iloc[0]
else:
# Optional: set to NaN or keep as zero if no match is found
averageExpenses.at[index, 'Year Constructed'] = np.nan # or some default value
averageExpenses1 = averageExpenses[['Year Constructed', 'Category','Average Expenses']]
Surprisingly, year of construction has low correlation with the average values of each of the expenses.
df = averageExpenses1[averageExpenses1['Category'] == 'Insurance Expense']
corr = df.corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
df = averageExpenses1[averageExpenses1['Category'] == 'Non-CAM Expenses']
corr = df.corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
df = averageExpenses1[averageExpenses1['Category'] == 'Operating Expenses']
corr = df.corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
df = averageExpenses1[averageExpenses1['Category'] == 'Real Estate Tax Expense']
corr = df.corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
Found gas price dataset on U.S Energy information administration. It contains average petrol price ascross the state. Currently it contains only 9 states, but they cover all axis of US:
- California
- Colorado
- Florida
- Massachusetts
- Minnesota
- New York
- Ohio
- Texas
- Washington
Reading Gas prices so they can be used to see how they might impact the expenses, if the state we want to predict not in this list we will make predictions based on the nearest state to it that exists in this list
Link: https://www.eia.gov/dnav/pet/pet_pri_gnd_a_epm0_pte_dpgal_w.htm
WIP 2
After searching online for retail prices of gasoline per state. A better dataset was found and I was downloaded from the following link: https://www.eia.gov/dnav/pet/PET_PRI_REFMG_A_EPM0_PWG_DPGAL_M.htm
gasData = pd.read_excel('States_gas.xls', sheet_name='Data 1', header = 1)
gasData.head()
| Sourcekey | EMA_EPM0_PWG_NUS_DPG | EMA_EPM0_PWG_R10_DPG | EMA_EPM0_PWG_R1X_DPG | EMA_EPM0_PWG_SCT_DPG | EMA_EPM0_PWG_SME_DPG | EMA_EPM0_PWG_SMA_DPG | EMA_EPM0_PWG_SNH_DPG | EMA_EPM0_PWG_SRI_DPG | EMA_EPM0_PWG_SVT_DPG | EMA_EPM0_PWG_R1Y_DPG | EMA_EPM0_PWG_SDE_DPG | EMA_EPM0_PWG_SDC_DPG | EMA_EPM0_PWG_SMD_DPG | EMA_EPM0_PWG_SNJ_DPG | EMA_EPM0_PWG_SNY_DPG | EMA_EPM0_PWG_SPA_DPG | EMA_EPM0_PWG_R1Z_DPG | EMA_EPM0_PWG_SFL_DPG | EMA_EPM0_PWG_SGA_DPG | EMA_EPM0_PWG_SNC_DPG | EMA_EPM0_PWG_SSC_DPG | EMA_EPM0_PWG_SVA_DPG | EMA_EPM0_PWG_SWV_DPG | EMA_EPM0_PWG_R20_DPG | EMA_EPM0_PWG_SIL_DPG | EMA_EPM0_PWG_SIN_DPG | EMA_EPM0_PWG_SIA_DPG | EMA_EPM0_PWG_SKS_DPG | EMA_EPM0_PWG_SKY_DPG | EMA_EPM0_PWG_SMI_DPG | EMA_EPM0_PWG_SMN_DPG | EMA_EPM0_PWG_SMO_DPG | EMA_EPM0_PWG_SNE_DPG | EMA_EPM0_PWG_SND_DPG | EMA_EPM0_PWG_SOH_DPG | EMA_EPM0_PWG_SOK_DPG | EMA_EPM0_PWG_SSD_DPG | EMA_EPM0_PWG_STN_DPG | EMA_EPM0_PWG_SWI_DPG | EMA_EPM0_PWG_R30_DPG | EMA_EPM0_PWG_SAL_DPG | EMA_EPM0_PWG_SAR_DPG | EMA_EPM0_PWG_SLA_DPG | EMA_EPM0_PWG_SMS_DPG | EMA_EPM0_PWG_SNM_DPG | EMA_EPM0_PWG_STX_DPG | EMA_EPM0_PWG_R40_DPG | EMA_EPM0_PWG_SCO_DPG | EMA_EPM0_PWG_SID_DPG | EMA_EPM0_PWG_SMT_DPG | EMA_EPM0_PWG_SUT_DPG | EMA_EPM0_PWG_SWY_DPG | EMA_EPM0_PWG_R50_DPG | EMA_EPM0_PWG_SAK_DPG | EMA_EPM0_PWG_SAZ_DPG | EMA_EPM0_PWG_SCA_DPG | EMA_EPM0_PWG_SHI_DPG | EMA_EPM0_PWG_SNV_DPG | EMA_EPM0_PWG_SOR_DPG | EMA_EPM0_PWG_SWA_DPG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Date | U.S. Total Gasoline Wholesale/Resale Price by ... | East Coast (PADD 1) Total Gasoline Wholesale/R... | New England (PADD 1A) Total Gasoline Wholesale... | Connecticut Total Gasoline Wholesale/Resale Pr... | Maine Total Gasoline Wholesale/Resale Price by... | Massachusetts Total Gasoline Wholesale/Resale ... | New Hampshire Total Gasoline Wholesale/Resale ... | Rhode Island Total Gasoline Wholesale/Resale P... | Vermont Total Gasoline Wholesale/Resale Price ... | Central Atlantic (PADD 1B) Total Gasoline Whol... | Delaware Total Gasoline Wholesale/Resale Price... | District of Columbia Total Gasoline Wholesale/... | Maryland Total Gasoline Wholesale/Resale Price... | New Jersey Total Gasoline Wholesale/Resale Pri... | New York Total Gasoline Wholesale/Resale Price... | Pennsylvania Total Gasoline Wholesale/Resale P... | Lower Atlantic (PADD 1C) Total Gasoline Wholes... | Florida Total Gasoline Wholesale/Resale Price ... | Georgia Total Gasoline Wholesale/Resale Price ... | North Carolina Total Gasoline Wholesale/Resale... | South Carolina Total Gasoline Wholesale/Resale... | Virginia Total Gasoline Wholesale/Resale Price... | West Virginia Total Gasoline Wholesale/Resale ... | Midwest (PADD 2) Total Gasoline Wholesale/Resa... | Illinois Total Gasoline Wholesale/Resale Price... | Indiana Total Gasoline Wholesale/Resale Price ... | Iowa Total Gasoline Wholesale/Resale Price by ... | Kansas Total Gasoline Wholesale/Resale Price b... | Kentucky Total Gasoline Wholesale/Resale Price... | Michigan Total Gasoline Wholesale/Resale Price... | Minnesota Total Gasoline Wholesale/Resale Pric... | Missouri Total Gasoline Wholesale/Resale Price... | Nebraska Total Gasoline Wholesale/Resale Price... | North Dakota Total Gasoline Wholesale/Resale P... | Ohio Total Gasoline Wholesale/Resale Price by ... | Oklahoma Total Gasoline Wholesale/Resale Price... | South Dakota Total Gasoline Wholesale/Resale P... | Tennessee Total Gasoline Wholesale/Resale Pric... | Wisconsin Total Gasoline Wholesale/Resale Pric... | Gulf Coast (PADD 3) Total Gasoline Wholesale/R... | Alabama Total Gasoline Wholesale/Resale Price ... | Arkansas Total Gasoline Wholesale/Resale Price... | Louisiana Total Gasoline Wholesale/Resale Pric... | Mississippi Total Gasoline Wholesale/Resale Pr... | New Mexico Total Gasoline Wholesale/Resale Pri... | Texas Total Gasoline Wholesale/Resale Price by... | Rocky Mountain (PADD 4) Total Gasoline Wholesa... | Colorado Total Gasoline Wholesale/Resale Price... | Idaho Total Gasoline Wholesale/Resale Price by... | Montana Total Gasoline Wholesale/Resale Price ... | Utah Total Gasoline Wholesale/Resale Price by ... | Wyoming Total Gasoline Wholesale/Resale Price ... | West Coast (PADD 5) Total Gasoline Wholesale/R... | Alaska Total Gasoline Wholesale/Resale Price b... | Arizona Total Gasoline Wholesale/Resale Price ... | California Total Gasoline Wholesale/Resale Pri... | Hawaii Total Gasoline Wholesale/Resale Price b... | Nevada Total Gasoline Wholesale/Resale Price b... | Oregon Total Gasoline Wholesale/Resale Price b... | Washington Total Gasoline Wholesale/Resale Pri... |
| 1 | 1983-01-15 00:00:00 | 0.885 | 0.91 | 0.944 | 0.957 | 0.918 | 0.939 | 0.959 | 0.942 | 0.994 | 0.92 | 0.859 | 0.932 | 0.907 | 0.911 | 0.964 | 0.901 | 0.889 | 0.907 | 0.877 | 0.866 | 0.873 | 0.896 | 0.944 | 0.888 | 0.896 | 0.896 | 0.892 | 0.847 | 0.895 | 0.906 | 0.894 | 0.869 | 0.881 | 0.915 | 0.914 | 0.855 | 0.896 | 0.877 | 0.895 | 0.861 | 0.871 | 0.869 | 0.876 | 0.852 | 0.916 | 0.854 | 0.891 | 0.865 | 0.937 | 0.887 | 0.913 | 0.904 | 0.861 | 1.019 | 0.869 | 0.859 | 1.018 | 0.896 | 0.828 | 0.836 |
| 2 | 1983-02-15 00:00:00 | 0.854 | 0.881 | 0.91 | 0.917 | 0.888 | 0.912 | 0.921 | 0.902 | 0.945 | 0.882 | 0.831 | 0.903 | 0.872 | 0.871 | 0.925 | 0.858 | 0.871 | 0.884 | 0.864 | 0.858 | 0.861 | 0.869 | 0.897 | 0.864 | 0.873 | 0.876 | 0.865 | 0.829 | 0.873 | 0.882 | 0.857 | 0.847 | 0.853 | 0.857 | 0.887 | 0.836 | 0.861 | 0.862 | 0.865 | 0.845 | 0.859 | 0.849 | 0.86 | 0.846 | 0.881 | 0.838 | 0.826 | 0.828 | 0.846 | 0.816 | 0.811 | 0.841 | 0.807 | 0.992 | 0.814 | 0.811 | 0.926 | 0.846 | 0.753 | 0.765 |
| 3 | 1983-03-15 00:00:00 | 0.829 | 0.856 | 0.878 | 0.88 | 0.855 | 0.881 | 0.894 | 0.875 | 0.906 | 0.849 | 0.803 | 0.886 | 0.854 | 0.843 | 0.895 | 0.815 | 0.856 | 0.869 | 0.853 | 0.844 | 0.845 | 0.854 | 0.866 | 0.839 | 0.841 | 0.847 | 0.843 | 0.808 | 0.85 | 0.857 | 0.836 | 0.827 | 0.831 | 0.834 | 0.854 | 0.815 | 0.833 | 0.843 | 0.837 | 0.821 | 0.844 | 0.821 | 0.84 | 0.826 | 0.833 | 0.814 | 0.793 | 0.813 | 0.782 | 0.774 | 0.772 | 0.802 | 0.778 | 0.891 | 0.763 | 0.793 | 0.885 | 0.808 | 0.718 | 0.719 |
| 4 | 1983-04-15 00:00:00 | 0.865 | 0.884 | 0.901 | 0.899 | 0.901 | 0.901 | 0.914 | 0.901 | 0.916 | 0.873 | 0.845 | 0.914 | 0.88 | 0.872 | 0.915 | 0.829 | 0.892 | 0.896 | 0.889 | 0.89 | 0.887 | 0.894 | 0.879 | 0.886 | 0.887 | 0.887 | 0.9 | 0.865 | 0.885 | 0.901 | 0.891 | 0.879 | 0.884 | 0.877 | 0.895 | 0.868 | 0.881 | 0.884 | 0.89 | 0.854 | 0.875 | 0.872 | 0.865 | 0.865 | 0.853 | 0.847 | 0.828 | 0.857 | 0.794 | 0.816 | 0.793 | 0.833 | 0.816 | 0.915 | 0.787 | 0.836 | 0.878 | 0.831 | 0.755 | 0.739 |
unitExpensesPerMonthWithGas = unitExpensesPerMonth.copy()
# Adding the state of each Unit so we can combine the prices of gas
unitExpensesPerMonthWithGas = unitExpensesPerMonthWithGas.merge(unit_df[['Business Unit', 'State']], on='Business Unit', how='left')
import pandas as pd
def get_substring_before_total(s):
# Find the index of ' Total'
index = s.find(' Total')
# If 'Total' is not found, return the original string
if index == -1:
return s
# Return the substring up to the index of 'Total'
return s[:index]
# Assuming gasData is your DataFrame
# Example DataFrame
# gasData = pd.DataFrame({'Column1': ['value with Total and more', 'another value'], 'Column2': ['second column Total', 'another value in column 2']})
# Applying the function to each cell in the first row
for col in gasData.columns:
gasData.at[0, col] = get_substring_before_total(gasData.at[0, col])
print(gasData)
Sourcekey ... EMA_EPM0_PWG_SWA_DPG 0 Date ... Washington 1 1983-01-15 00:00:00 ... 0.836 2 1983-02-15 00:00:00 ... 0.765 3 1983-03-15 00:00:00 ... 0.719 4 1983-04-15 00:00:00 ... 0.739 .. ... ... ... 467 2021-11-15 00:00:00 ... 2.522 468 2021-12-15 00:00:00 ... 2.46 469 2022-01-15 00:00:00 ... 2.625 470 2022-02-15 00:00:00 ... 2.657 471 2022-03-15 00:00:00 ... 3.275 [472 rows x 61 columns]
# Set the first row as the header
gasData.columns = gasData.iloc[0]
# Drop the first row
gasData = gasData.drop(gasData.index[0])
# Reset the index if needed
gasData = gasData.reset_index(drop=True)
# Convert to datetime
gasData['Date'] = pd.to_datetime(gasData['Date'])
# Format the date as 'YYYY-MM'
gasData['Date'] = gasData['Date'].dt.strftime('%Y-%m')
gasData.head()
| Date | U.S. | East Coast (PADD 1) | New England (PADD 1A) | Connecticut | Maine | Massachusetts | New Hampshire | Rhode Island | Vermont | Central Atlantic (PADD 1B) | Delaware | District of Columbia | Maryland | New Jersey | New York | Pennsylvania | Lower Atlantic (PADD 1C) | Florida | Georgia | North Carolina | South Carolina | Virginia | West Virginia | Midwest (PADD 2) | Illinois | Indiana | Iowa | Kansas | Kentucky | Michigan | Minnesota | Missouri | Nebraska | North Dakota | Ohio | Oklahoma | South Dakota | Tennessee | Wisconsin | Gulf Coast (PADD 3) | Alabama | Arkansas | Louisiana | Mississippi | New Mexico | Texas | Rocky Mountain (PADD 4) | Colorado | Idaho | Montana | Utah | Wyoming | West Coast (PADD 5) | Alaska | Arizona | California | Hawaii | Nevada | Oregon | Washington | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1983-01 | 0.885 | 0.91 | 0.944 | 0.957 | 0.918 | 0.939 | 0.959 | 0.942 | 0.994 | 0.92 | 0.859 | 0.932 | 0.907 | 0.911 | 0.964 | 0.901 | 0.889 | 0.907 | 0.877 | 0.866 | 0.873 | 0.896 | 0.944 | 0.888 | 0.896 | 0.896 | 0.892 | 0.847 | 0.895 | 0.906 | 0.894 | 0.869 | 0.881 | 0.915 | 0.914 | 0.855 | 0.896 | 0.877 | 0.895 | 0.861 | 0.871 | 0.869 | 0.876 | 0.852 | 0.916 | 0.854 | 0.891 | 0.865 | 0.937 | 0.887 | 0.913 | 0.904 | 0.861 | 1.019 | 0.869 | 0.859 | 1.018 | 0.896 | 0.828 | 0.836 |
| 1 | 1983-02 | 0.854 | 0.881 | 0.91 | 0.917 | 0.888 | 0.912 | 0.921 | 0.902 | 0.945 | 0.882 | 0.831 | 0.903 | 0.872 | 0.871 | 0.925 | 0.858 | 0.871 | 0.884 | 0.864 | 0.858 | 0.861 | 0.869 | 0.897 | 0.864 | 0.873 | 0.876 | 0.865 | 0.829 | 0.873 | 0.882 | 0.857 | 0.847 | 0.853 | 0.857 | 0.887 | 0.836 | 0.861 | 0.862 | 0.865 | 0.845 | 0.859 | 0.849 | 0.86 | 0.846 | 0.881 | 0.838 | 0.826 | 0.828 | 0.846 | 0.816 | 0.811 | 0.841 | 0.807 | 0.992 | 0.814 | 0.811 | 0.926 | 0.846 | 0.753 | 0.765 |
| 2 | 1983-03 | 0.829 | 0.856 | 0.878 | 0.88 | 0.855 | 0.881 | 0.894 | 0.875 | 0.906 | 0.849 | 0.803 | 0.886 | 0.854 | 0.843 | 0.895 | 0.815 | 0.856 | 0.869 | 0.853 | 0.844 | 0.845 | 0.854 | 0.866 | 0.839 | 0.841 | 0.847 | 0.843 | 0.808 | 0.85 | 0.857 | 0.836 | 0.827 | 0.831 | 0.834 | 0.854 | 0.815 | 0.833 | 0.843 | 0.837 | 0.821 | 0.844 | 0.821 | 0.84 | 0.826 | 0.833 | 0.814 | 0.793 | 0.813 | 0.782 | 0.774 | 0.772 | 0.802 | 0.778 | 0.891 | 0.763 | 0.793 | 0.885 | 0.808 | 0.718 | 0.719 |
| 3 | 1983-04 | 0.865 | 0.884 | 0.901 | 0.899 | 0.901 | 0.901 | 0.914 | 0.901 | 0.916 | 0.873 | 0.845 | 0.914 | 0.88 | 0.872 | 0.915 | 0.829 | 0.892 | 0.896 | 0.889 | 0.89 | 0.887 | 0.894 | 0.879 | 0.886 | 0.887 | 0.887 | 0.9 | 0.865 | 0.885 | 0.901 | 0.891 | 0.879 | 0.884 | 0.877 | 0.895 | 0.868 | 0.881 | 0.884 | 0.89 | 0.854 | 0.875 | 0.872 | 0.865 | 0.865 | 0.853 | 0.847 | 0.828 | 0.857 | 0.794 | 0.816 | 0.793 | 0.833 | 0.816 | 0.915 | 0.787 | 0.836 | 0.878 | 0.831 | 0.755 | 0.739 |
| 4 | 1983-05 | 0.904 | 0.912 | 0.932 | 0.926 | 0.922 | 0.939 | 0.944 | 0.927 | 0.954 | 0.897 | 0.848 | 0.944 | 0.911 | 0.894 | 0.945 | 0.859 | 0.922 | 0.93 | 0.915 | 0.917 | 0.918 | 0.927 | 0.9 | 0.919 | 0.923 | 0.916 | 0.937 | 0.898 | 0.917 | 0.931 | 0.928 | 0.915 | 0.921 | 0.914 | 0.916 | 0.902 | 0.918 | 0.917 | 0.925 | 0.883 | 0.907 | 0.913 | 0.895 | 0.887 | 0.88 | 0.875 | 0.87 | 0.898 | 0.838 | 0.856 | 0.847 | 0.87 | 0.895 | 0.853 | 0.858 | 0.916 | 0.911 | 0.906 | 0.84 | 0.828 |
unitExpensesPerMonthWithGas.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 18.0 |
| Category | Insurance Expense | Non-CAM Expenses | Operating Expenses | Real Estate Tax Expense | Insurance Expense |
| 2000-01 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-02 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| 2000-03 | 0.0 | 0.0 | 0.0 | 0.0 | -4175.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-05 | -12948.47 | -482.55 | -25715.77 | -38166.58 | -22037.25 |
| 2023-06 | -18159.12 | -1933.35 | -77736.53 | -38166.59 | -31604.16 |
| 2023-07 | -14685.33 | -479.55 | -96710.95 | -45473.42 | -25226.22 |
| 2023-08 | -14685.33 | -6986.82 | -72839.18 | -39210.42 | -25226.22 |
| State | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Florida |
287 rows × 5 columns
unitExpensesPerMonthWithGas.isnull().sum()
Business Unit 0
Category 0
2000-01 0
2000-02 0
2000-03 0
..
2023-05 0
2023-06 0
2023-07 0
2023-08 0
State 0
Length: 287, dtype: int64
len(unitExpensesPerMonthWithGas['Business Unit'].unique())
431
Merging Gas prices with unit monthly expenses and values to determine effect of gas price on each of the spendings
import pandas as pd
count = 0
for unit in unitExpensesPerMonthWithGas['Business Unit'].unique():
count = count + 1
print(count)
# Get the state for the current unit
state = unitExpensesPerMonthWithGas[unitExpensesPerMonthWithGas['Business Unit'] == unit]['State'].iloc[0]
# Initialize the new row with constant values
new_row = {'Business Unit': unit, 'State': state, 'Category': 'Petrol'}
# Flag to check if a matching date is found
match_found = False
# Iterate through each year-month column
for ym_col in unitExpensesPerMonthWithGas.columns[unitExpensesPerMonthWithGas.columns.str.match('\d{4}-\d{2}')]:
if state in gasData.columns:
# Find the matching row in gasData based on the date
matching_row = gasData[gasData['Date'] == ym_col]
if not matching_row.empty:
# Retrieve the price for the state from the matching row
price = matching_row[state].iloc[0]
# Add the price to the new row
new_row[ym_col] = price
match_found = True
else:
print(f"State {state} not found in gasData columns")
# Append the new row to unitExpensesPerMonthWithGas only if a match was found
if match_found:
unitExpensesPerMonthWithGas = unitExpensesPerMonthWithGas.append(new_row, ignore_index=True)
print(unitExpensesPerMonthWithGas)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
Business Unit Category ... 2023-08 State
0 15.0 Insurance Expense ... -14685.33 Pennsylvania
1 15.0 Non-CAM Expenses ... -6986.82 Pennsylvania
2 15.0 Operating Expenses ... -72839.18 Pennsylvania
3 15.0 Real Estate Tax Expense ... -39210.42 Pennsylvania
4 18.0 Insurance Expense ... -25226.22 Florida
... ... ... ... ... ...
2129 8001801.0 Petrol ... NaN Georgia
2130 8010201.0 Petrol ... NaN Connecticut
2131 8010202.0 Petrol ... NaN Connecticut
2132 8010203.0 Petrol ... NaN Connecticut
2133 8010204.0 Petrol ... NaN Connecticut
[2134 rows x 287 columns]
sorted_df = unitExpensesPerMonthWithGas.sort_values(by='Business Unit')
print(sorted_df.isnull().sum())
Business Unit 0
Category 0
2000-01 0
2000-02 0
2000-03 0
...
2023-05 431
2023-06 431
2023-07 431
2023-08 431
State 0
Length: 287, dtype: int64
Filling null values with 0 in order to plot it
sorted_df = sorted_df.fillna(0)
From the plot below it seems price per galon is very small and doesn't show on the plot, We will multiply the price per galon by a value for us to see its effect on the rest of values.
import pandas as pd
import matplotlib.pyplot as plt
# Assuming sorted_df is your DataFrame and it's already sorted by 'Business Unit'
# Identify columns that match the 'YYYY-MM' pattern and sort them
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])
# Iterate over each unique value in the 'Business Unit' column
for unit in sorted_df['Business Unit'].unique():
# Selecting the relevant rows for the unit
unit_df = sorted_df[sorted_df['Business Unit'] == unit]
plt.figure() # Create a new figure for each Business Unit
# Iterate over each unique category within the unit
for category in unit_df['Category'].unique():
# Selecting the relevant rows for the category
category_df = unit_df[unit_df['Category'] == category]
# Plotting the line for the category
plt.plot(date_columns, category_df[date_columns].iloc[0], label=category)
# Add legends, labels, and title for each Business Unit
plt.legend()
plt.xlabel('Date')
plt.ylabel('Values')
plt.title(f'Line Plot for Business Unit {unit} by Date')
#plt.xticks(ticks=range(len(date_columns)), labels=date_columns, rotation=45)
plt.xticks(rotation=90)
# Rotate labels for better readability
#plt.gcf().autofmt_xdate()
# Set x-axis to show every third tick
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
# Show the plot for each Business Unit
plt.show()
break
df = sorted_df.copy()
sorted_df.head().T
| 0 | 3 | 1703 | 1 | 2 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 15.0 |
| Category | Insurance Expense | Real Estate Tax Expense | Petrol | Non-CAM Expenses | Operating Expenses |
| 2000-01 | 0.0 | 0.0 | 0.776 | 0.0 | 0.0 |
| 2000-02 | 0.0 | 0.0 | 0.845 | 0.0 | 0.0 |
| 2000-03 | 0.0 | 0.0 | 0.963 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-05 | -12948.47 | -38166.58 | 0.0 | -482.55 | -25715.77 |
| 2023-06 | -18159.12 | -38166.59 | 0.0 | -1933.35 | -77736.53 |
| 2023-07 | -14685.33 | -45473.42 | 0.0 | -479.55 | -96710.95 |
| 2023-08 | -14685.33 | -39210.42 | 0.0 | -6986.82 | -72839.18 |
| State | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
287 rows × 5 columns
Multiplier is being negative so we can compare trends easily and to avoid the plot getting too small to see any observations.
sorted_df = df.copy()
# getting date columns
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])
#Multiplying petrol values by a multiplier
multiplier = -10000
# Select rows where Category is 'Petrol' and multiply
sorted_df.loc[sorted_df['Category'] == 'Petrol', date_columns] *= multiplier
Plotted only 20 as plotting all the units break the notebook due to the size of it.
Looking at the plots and the aplified price of gasoline we can deduce the following:
- Gasoline price has medium effect on Insurance expense.
- Gasoline price has no effect on Tax expense
- Gasoline price has no effect on Non-Cam expenses
- Gasoline price has high effect on operating expenses
- Operating expeses go up with Gasoline price but doesn't go down with it, this can be attributed to the fact other businesses can increase price to adjust to increasing petrol price but rarely lower it.
import pandas as pd
import matplotlib.pyplot as plt
# Create a dictionary that maps categories to colors
category_colors = {
'Insurance Expense': 'blue',
'Real Estate Tax Expense': 'green',
'Petrol': 'black',
'Non-CAM Expenses': 'orange',
'Operating Expenses': 'red',
# Add more categories and their corresponding colors as needed
}
# Identify columns that match the 'YYYY-MM' pattern and sort them
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])
count = 0
# Iterate over each unique value in the 'Business Unit' column
for unit in sorted_df['Business Unit'].unique():
# plotting every unit breaks the notbook
count = count + 1
if (count >20):
break
# Selecting the relevant rows for the unit
unit_df = sorted_df[sorted_df['Business Unit'] == unit]
plt.figure() # Create a new figure for each Business Unit
# Iterate over each unique category within the unit
for category in unit_df['Category'].unique():
# Selecting the relevant rows for the category
category_df = unit_df[unit_df['Category'] == category]
# Use the dictionary to retrieve the color for the category
color = category_colors.get(category, 'black')
# Plotting the line for the category with the specified color
plt.plot(date_columns, category_df[date_columns].iloc[0], label=category, color=color)
# Add legends, labels, and title for each Business Unit
plt.legend()
plt.xlabel('Date')
plt.ylabel('Values')
plt.title(f'Line Plot for Business Unit {unit} by Date')
#plt.xticks(ticks=range(len(date_columns)), labels=date_columns, rotation=45)
plt.xticks(rotation=90)
# Rotate labels for better readability
#plt.gcf().autofmt_xdate()
# Set x-axis to show every third tick
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
# Show the plot for each Business Unit
plt.show()
We will calculate seasonality and trend for each of the categories
trend_df = sorted_df.drop(columns=['State', 'Business Unit'])
date_columns = sorted(df.columns[df.columns.str.match('\d{4}-\d{2}')])
#Getting mean of the expenses for all units per month by category
trend_df = trend_df.groupby('Category')[date_columns].mean()
trend_df = trend_df.T
trend_df.head()
| Category | Insurance Expense | Non-CAM Expenses | Operating Expenses | Petrol | Real Estate Tax Expense |
|---|---|---|---|---|---|
| 2000-01 | -251.157143 | -105.191740 | -2127.562668 | -7999.234339 | -3014.519169 |
| 2000-02 | -251.157143 | -230.288561 | -2627.131647 | -8887.285383 | -3340.863064 |
| 2000-03 | -251.157143 | -187.583782 | -2407.397541 | -10119.628770 | -3164.363040 |
| 2000-04 | -251.157143 | -206.295824 | -2429.281160 | -9181.322506 | -2831.686532 |
| 2000-05 | -251.157143 | -213.932807 | -2527.201392 | -9897.795824 | -3126.370784 |
# packages for time series analysis
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
def plot_trends(trend):
trend_df.index = pd.to_datetime(trend_df.index)
# Perform time series decompositon
decomposition = sm.tsa.seasonal_decompose(trend_df[trend]) # change this cammand
# Plot decomposition
fig = decomposition.plot()
# Plot settings
%config InlineBackend.figure_format='retina'
plt.style.use("ggplot")
plt.rcParams["figure.figsize"] = 12, 9 # Figure size (width,height)
plt.xlabel("Date", fontsize=12) # Text and size of xlabel
plt.suptitle(
"Daily Price Time Series Decomposition", # Text of title
y=1.05, # Position of title
size=15, # Size of title
)
plt.show()
import matplotlib.dates as mdates
def plot_trends(trend):
# Ensure the index is a datetime object
trend_df.index = pd.to_datetime(trend_df.index)
# Perform time series decomposition
decomposition = sm.tsa.seasonal_decompose(trend_df[trend], model='additive')
# Plot decomposition
fig = decomposition.plot()
# Improve plot aesthetics and settings
plt.style.use("ggplot")
plt.rcParams["figure.figsize"] = (12, 9) # Figure size (width,height)
# Adjusting x-axis to display labels every 6 months
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=6)) # Set major ticks to every 6 months
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b %Y')) # Format ticks to show month and year
# Rotate the tick labels for x-axis
for label in plt.gca().get_xticklabels():
label.set_rotation(90)
plt.xlabel("Date", fontsize=12) # Text and size of xlabel
plt.suptitle(
"Daily Price Time Series Decomposition", # Text of title
y=1.05, # Position of title
size=15, # Size of title
)
plt.show()
Based on trend and seaasonality analysis we infer the following:
- For all expenses there is increasing trend in the value of expenses
- For all expenses there is a degree of seasonality, where pattern repeats around December and June. It usually lowers at June and peaks at december
- Insurance Expense and Real state Tax have low Residue while Non-CAM expense and operating expenses has high varience in residue
- petrol has seasonality around same time as expenses, but there is no obious trend for the pricing. 'This is the amplified petrol price'
arr = ['Insurance Expense','Non-CAM Expenses','Operating Expenses','Real Estate Tax Expense','Petrol']
for val in arr:
plot_trends(val)
corr = trend_df.corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for numerical Values')
plt.show()
trend_df.head()
| Category | Insurance Expense | Non-CAM Expenses | Operating Expenses | Petrol | Real Estate Tax Expense |
|---|---|---|---|---|---|
| 2000-01-01 | -251.157143 | -105.191740 | -2127.562668 | -7999.234339 | -3014.519169 |
| 2000-02-01 | -251.157143 | -230.288561 | -2627.131647 | -8887.285383 | -3340.863064 |
| 2000-03-01 | -251.157143 | -187.583782 | -2407.397541 | -10119.628770 | -3164.363040 |
| 2000-04-01 | -251.157143 | -206.295824 | -2429.281160 | -9181.322506 | -2831.686532 |
| 2000-05-01 | -251.157143 | -213.932807 | -2527.201392 | -9897.795824 | -3126.370784 |
Based on the findings so far a Machine learning model can be created to predict the expenses of Regency square for following reasons:
- Insurance Expense:
- Based on the plots and residue a Logisitc regression model might be sufficient for predicting the expenses. However, the sudden spikes in the expenses cannot be explained by my current level of knowledge of the business data.
- Non-CAM Expenses:
- For the same reasons as Insurance expense a logistic regression model or a linear regression might be sufficient in predicting expenses.
- Operating Expenses:
- By far operating expenses are hardest to predict due to high varience, but despite correlation heatmap, we believe an increase in petrol price affects the expenses and increases the value. We believe the low correlation comes from expenses not lowering when prices get lower and the slow adaptation to petrol prices and their effect on expenses.
- Real State Tax:
- We cannot understand underlying patterns in the data and the reason for frequent spikes in expenses. We think that is due to lack of our knowledge about the business information in order to infer any information from it.
Based on the correlation heatmap, It seems all expenses values are relevant greatly. Thus we determine when making a machine learning model, it is required to use other expenses of the other categories as inputs.
WIP 3
We will work on creating a model to predict the expenses based on the extracted attributes
sorted_df.head().T
| 0 | 3 | 1703 | 1 | 2 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 15.0 |
| Category | Insurance Expense | Real Estate Tax Expense | Petrol | Non-CAM Expenses | Operating Expenses |
| 2000-01 | 0.0 | 0.0 | -7760.0 | 0.0 | 0.0 |
| 2000-02 | 0.0 | 0.0 | -8450.0 | 0.0 | 0.0 |
| 2000-03 | 0.0 | 0.0 | -9630.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-05 | -12948.47 | -38166.58 | -0.0 | -482.55 | -25715.77 |
| 2023-06 | -18159.12 | -38166.59 | -0.0 | -1933.35 | -77736.53 |
| 2023-07 | -14685.33 | -45473.42 | -0.0 | -479.55 | -96710.95 |
| 2023-08 | -14685.33 | -39210.42 | -0.0 | -6986.82 | -72839.18 |
| State | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
287 rows × 5 columns
unit_df.head()
| Business Unit | Division | Region | State | Zip Code | Lat | Long | Year Constructed | Year Acquired | Expenses | Near Water | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | Northeast | PA-DE | Pennsylvania | 19087 | 40.0612 | -75.3999 | 1960.0 | 2004.0 | -1 | 0 |
| 1 | 18.0 | Southeast | Tampa | Florida | 33511 | 27.9056 | -82.2881 | 1986.0 | 1993.0 | -1 | 1 |
| 2 | 20.0 | Southeast | Jacksonville | Florida | 32606 | 29.6954 | -82.4023 | 1986.0 | 1994.0 | -1 | 1 |
| 3 | 43.0 | Southeast | Jacksonville | Florida | 32605 | 29.6785 | -82.3679 | 1974.0 | 1993.0 | -1 | 1 |
| 4 | 51.0 | Southeast | Tampa | Florida | 34104 | 26.1529 | -81.7417 | 1992.0 | 1994.0 | -1 | 1 |
# Merging dataframes to combine the attributes into a single dataframe
merged_df = sorted_df.merge(unit_df[['Business Unit', 'State', 'Year Constructed', 'Near Water']],
on='Business Unit',
how='left')
merged_df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 15.0 |
| Category | Insurance Expense | Real Estate Tax Expense | Petrol | Non-CAM Expenses | Operating Expenses |
| 2000-01 | 0.0 | 0.0 | -7760.0 | 0.0 | 0.0 |
| 2000-02 | 0.0 | 0.0 | -8450.0 | 0.0 | 0.0 |
| 2000-03 | 0.0 | 0.0 | -9630.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 2023-08 | -14685.33 | -39210.42 | -0.0 | -6986.82 | -72839.18 |
| State_x | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
| State_y | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
| Year Constructed | 1960.0 | 1960.0 | 1960.0 | 1960.0 | 1960.0 |
| Near Water | 0 | 0 | 0 | 0 | 0 |
290 rows × 5 columns
# Moving the last 3 columns to the beggining
columns = merged_df.columns.tolist()
# Reorder the columns to move the last three to the beginning
new_columns = columns[-3:] + columns[:-3]
# Create a new DataFrame with the reordered columns
merged_df = merged_df[new_columns]
# Display the updated DataFrame
merged_df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| State_y | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
| Year Constructed | 1960.0 | 1960.0 | 1960.0 | 1960.0 | 1960.0 |
| Near Water | 0 | 0 | 0 | 0 | 0 |
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 15.0 |
| Category | Insurance Expense | Real Estate Tax Expense | Petrol | Non-CAM Expenses | Operating Expenses |
| ... | ... | ... | ... | ... | ... |
| 2023-05 | -12948.47 | -38166.58 | -0.0 | -482.55 | -25715.77 |
| 2023-06 | -18159.12 | -38166.59 | -0.0 | -1933.35 | -77736.53 |
| 2023-07 | -14685.33 | -45473.42 | -0.0 | -479.55 | -96710.95 |
| 2023-08 | -14685.33 | -39210.42 | -0.0 | -6986.82 | -72839.18 |
| State_x | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
290 rows × 5 columns
# Removing duplicate column
merged_df = merged_df.drop(columns=['State_x'])
merged_df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| State_y | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania | Pennsylvania |
| Year Constructed | 1960.0 | 1960.0 | 1960.0 | 1960.0 | 1960.0 |
| Near Water | 0 | 0 | 0 | 0 | 0 |
| Business Unit | 15.0 | 15.0 | 15.0 | 15.0 | 15.0 |
| Category | Insurance Expense | Real Estate Tax Expense | Petrol | Non-CAM Expenses | Operating Expenses |
| ... | ... | ... | ... | ... | ... |
| 2023-04 | -12948.47 | -38166.58 | -0.0 | -720.05 | -73458.0 |
| 2023-05 | -12948.47 | -38166.58 | -0.0 | -482.55 | -25715.77 |
| 2023-06 | -18159.12 | -38166.59 | -0.0 | -1933.35 | -77736.53 |
| 2023-07 | -14685.33 | -45473.42 | -0.0 | -479.55 | -96710.95 |
| 2023-08 | -14685.33 | -39210.42 | -0.0 | -6986.82 | -72839.18 |
289 rows × 5 columns
# Creating a date column and getting rid of all the year-month columns
df_melted = pd.melt(merged_df, id_vars=['State_y', 'Year Constructed', 'Near Water', 'Business Unit', 'Category'],
var_name='Date', value_name='Value')
df_melted.head()
| State_y | Year Constructed | Near Water | Business Unit | Category | Date | Value | |
|---|---|---|---|---|---|---|---|
| 0 | Pennsylvania | 1960.0 | 0 | 15.0 | Insurance Expense | 2000-01 | 0.0 |
| 1 | Pennsylvania | 1960.0 | 0 | 15.0 | Real Estate Tax Expense | 2000-01 | 0.0 |
| 2 | Pennsylvania | 1960.0 | 0 | 15.0 | Petrol | 2000-01 | -7760.0 |
| 3 | Pennsylvania | 1960.0 | 0 | 15.0 | Non-CAM Expenses | 2000-01 | 0.0 |
| 4 | Pennsylvania | 1960.0 | 0 | 15.0 | Operating Expenses | 2000-01 | 0.0 |
df = df_melted.copy()
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# Transforming categorical attributes to numerical
df['State_y'] = encoder.fit_transform(df['State_y'])
df['Category'] = encoder.fit_transform(df['Category'])
# Convert the date column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Optionally, extract features from the date (like year, month, day, etc.)
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
X = df[['year', 'month', 'State_y', 'Year Constructed', 'Near Water','Business Unit','Category']]
y = df['Value']
# Time Series split takes into consideration the time when doing the splits which is good for our case
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import LinearRegression
Used MSE to get the scores for the models. However,it is hard to determine how good the model is using those scores, but the lower value the better.
tscv = TimeSeriesSplit(n_splits=10)
# Create your machine learning model (e.g., Linear Regression)
model = LinearRegression()
# Perform cross-validation
scores = cross_val_score(model, X=X, y=y, cv=tscv, scoring='neg_mean_squared_error')
formatted_scores = [f'{score:.2f}' for score in scores]
# Print the cross-validation scores
print("Cross-validation scores:", formatted_scores)
Cross-validation scores: ['-48087238.06', '-97633884.03', '-172249454.47', '-172575673.78', '-209691621.38', '-223197640.65', '-238692090.66', '-675365995.26', '-689415896.40', '-836801141.05']
# Plotting the values of the models to get more visual information about how good is our model.
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Assume X and y are defined
# Create a TimeSeriesSplit iterator
def train_and_plot_model(model,X,y):
tscv = TimeSeriesSplit(n_splits=5)
model = model
# Storing actual and predicted values for plotting
actual_values = []
predicted_values = []
count = 0
for train_index, test_index in tscv.split(X):
count = count+1
print(count)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
actual_values.extend(y_test)
predicted_values.extend(predictions)
# Plotting the actual vs predicted values
plt.figure(figsize=(10, 6))
plt.plot(actual_values, label='Actual Values')
plt.plot(predicted_values, label='Predicted Values', alpha=0.7)
plt.title('Actual vs Predicted Values')
plt.xlabel('Time Step')
plt.ylabel('Target Variable')
plt.legend()
plt.show()
train_and_plot_model(LinearRegression(),X,y)
train_and_plot_model(RandomForestRegressor(),X,y)
train_and_plot_model(GradientBoostingRegressor(),X,y)
**** In the plots Random Forest shows the bet estimation in predicting the expenses. Given that the model is optimized further using hyper paprameter or provided more accurate data. It is expected for this model to become a very good estimator of the expenses